{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第7章 类别变量分析"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 初始化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 本章需要用到的库\n",
    "import numpy as np # 导入numpy库\n",
    "import pandas as pd # 导入pandas库\n",
    "from scipy.stats import chisquare, chi2_contingency # 导入卡方检验函数, 卡方独立性检验函数"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7.1 一个类别变量的拟合优度检验"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.1.1 期望频数相等"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>饮料类型</th>\n",
       "      <th>人数</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>碳酸饮料</td>\n",
       "      <td>525</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>矿泉水</td>\n",
       "      <td>550</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>果汁</td>\n",
       "      <td>470</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>其他</td>\n",
       "      <td>455</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   饮料类型   人数\n",
       "0  碳酸饮料  525\n",
       "1   矿泉水  550\n",
       "2    果汁  470\n",
       "3    其他  455"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example7_1 = pd.read_csv('./pydata/chap07/example7_1.csv', encoding='gbk') # 读取数据\n",
    "example7_1"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "检验消费者对不同类型饮料的偏好是否有显著差异(α=0.05)\n",
    "- $H_0:$ 观察频数与期望频数无显著差异(无显著偏好)\n",
    "- $H_1:$ 观察频数与期望频数有显著差异(有显著偏好)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "卡方统计量 = 12.10, p值 = 0.007048\n"
     ]
    }
   ],
   "source": [
    "chi2, p_value = chisquare(f_obs=example7_1['人数'])\n",
    "print(f'卡方统计量 = {chi2:.2f}, p值 = {p_value:.4g}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于P<0.05, 所以拒绝原假设"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.1.2 期望频数不相等"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>受教育程度</th>\n",
       "      <th>离婚家庭数</th>\n",
       "      <th>期望比例</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>小学及以下</td>\n",
       "      <td>30</td>\n",
       "      <td>0.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>初中</td>\n",
       "      <td>110</td>\n",
       "      <td>0.35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>高中</td>\n",
       "      <td>80</td>\n",
       "      <td>0.25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>大学</td>\n",
       "      <td>25</td>\n",
       "      <td>0.12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>研究生</td>\n",
       "      <td>15</td>\n",
       "      <td>0.08</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   受教育程度  离婚家庭数  期望比例\n",
       "0  小学及以下     30  0.20\n",
       "1     初中    110  0.35\n",
       "2     高中     80  0.25\n",
       "3     大学     25  0.12\n",
       "4    研究生     15  0.08"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example7_2 = pd.read_csv('./pydata/chap07/example7_2.csv', encoding='gbk') # 读取数据\n",
    "example7_2"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "提出假设:\n",
    "- $H_0:$ 不同受教育程度的离婚家庭数与期望频数无显著差异\n",
    "- $H_1:$ 不同受教育程度的离婚家庭数与期望频数有显著差异"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "卡方统计量 = 19.59, p值 = 0.0006028\n"
     ]
    }
   ],
   "source": [
    "chi2, p_value = chisquare(\n",
    "    f_obs=example7_2['离婚家庭数'], \n",
    "    f_exp=example7_2['离婚家庭数'].sum() * example7_2['期望比例']\n",
    ")\n",
    "print(f'卡方统计量 = {chi2:.2f}, p值 = {p_value:.4g}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于P<0.05, 所以拒绝原假设"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7.2 两个类别变量的独立性检验"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.2.1 列联表与$\\chi^2$独立性检验"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>东部</th>\n",
       "      <th>中部</th>\n",
       "      <th>西部</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>满意</th>\n",
       "      <td>126</td>\n",
       "      <td>158</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>不满意</th>\n",
       "      <td>34</td>\n",
       "      <td>82</td>\n",
       "      <td>65</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      东部   中部  西部\n",
       "满意   126  158  35\n",
       "不满意   34   82  65"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = [[126, 158, 35], [34, 82, 65]]\n",
    "df = pd.DataFrame(x, index=['满意', '不满意'], columns=['东部', '中部', '西部'])\n",
    "df"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "提出假设:\n",
    "- $H_0:$ 满意度与地区独立\n",
    "- $H_1:$ 满意度与地区不独立"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "卡方统计量 = 51.827, 自由度 = 2, p值 = 5.572e-12\n",
      "期望频数为: \n",
      " [[102.08 153.12  63.8 ]\n",
      " [ 57.92  86.88  36.2 ]]\n"
     ]
    }
   ],
   "source": [
    "chi2, p_value, dof, exp = chi2_contingency(df)\n",
    "print(f'卡方统计量 = {chi2:.3f}, 自由度 = {dof}, p值 = {p_value:.4g}')\n",
    "print('期望频数为: \\n', exp)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于P<0.05, 所以拒绝原假设"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>地区</th>\n",
       "      <th>满意度</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>东部</td>\n",
       "      <td>满意</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>东部</td>\n",
       "      <td>满意</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>东部</td>\n",
       "      <td>满意</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>东部</td>\n",
       "      <td>满意</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>东部</td>\n",
       "      <td>满意</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   地区 满意度\n",
       "0  东部  满意\n",
       "1  东部  满意\n",
       "2  东部  满意\n",
       "3  东部  满意\n",
       "4  东部  满意"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 使用原始数据进行检验\n",
    "example7_3 = pd.read_csv('./pydata/chap07/example7_3.csv', encoding='gbk') # 读取数据\n",
    "example7_3.head() # 显示前5行数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>地区</th>\n",
       "      <th>东部</th>\n",
       "      <th>中部</th>\n",
       "      <th>西部</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>满意度</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>不满意</th>\n",
       "      <td>34</td>\n",
       "      <td>82</td>\n",
       "      <td>65</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>满意</th>\n",
       "      <td>126</td>\n",
       "      <td>158</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "地区    东部   中部  西部\n",
       "满意度              \n",
       "不满意   34   82  65\n",
       "满意   126  158  35"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count = pd.crosstab(index = example7_3['满意度'], columns = example7_3['地区'])\n",
    "count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "卡方统计量 = 51.827, 自由度 = 2, p值 = 5.572e-12\n",
      "期望频数为: \n",
      " [[ 57.92  86.88  36.2 ]\n",
      " [102.08 153.12  63.8 ]]\n"
     ]
    }
   ],
   "source": [
    "chi2, p_value, dof, exp = chi2_contingency(count)\n",
    "print(f'卡方统计量 = {chi2:.3f}, 自由度 = {dof}, p值 = {p_value:.4g}')\n",
    "print('期望频数为: \\n', exp)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于P<0.05, 所以拒绝原假设"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.2.2 应用$\\chi^2$检验的注意事项"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7.3 两个类别变量的相关性度量"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.3.1 $\\varphi$系数和Cramer's V系数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "phi系数 = 0.32195, v系数 = 0.32195\n"
     ]
    }
   ],
   "source": [
    "n = 500 # 样本容量\n",
    "chi2 = 51.827 # 卡方统计量 使用例7-3中的卡方统计量\n",
    "phi = np.sqrt(chi2 / n) # 计算phi系数\n",
    "v = np.sqrt(chi2 / (n * (2 - 1))) # 计算Cramer's V系数\n",
    "print(f'phi系数 = {phi:.5f}, v系数 = {v:.5f}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.3.2 列联系数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "c系数 = 0.30646\n"
     ]
    }
   ],
   "source": [
    "c = np.sqrt(chi2 / (chi2 + n)) # 计算列联系数\n",
    "print(f'c系数 = {c:.5f}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>月份</th>\n",
       "      <th>销售量</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1月</td>\n",
       "      <td>1660</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2月</td>\n",
       "      <td>1600</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3月</td>\n",
       "      <td>1560</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4月</td>\n",
       "      <td>1490</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5月</td>\n",
       "      <td>1380</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6月</td>\n",
       "      <td>1620</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>7月</td>\n",
       "      <td>1580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>8月</td>\n",
       "      <td>1680</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>9月</td>\n",
       "      <td>1550</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>10月</td>\n",
       "      <td>1370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>11月</td>\n",
       "      <td>1410</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>12月</td>\n",
       "      <td>1610</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     月份   销售量\n",
       "0    1月  1660\n",
       "1    2月  1600\n",
       "2    3月  1560\n",
       "3    4月  1490\n",
       "4    5月  1380\n",
       "5    6月  1620\n",
       "6    7月  1580\n",
       "7    8月  1680\n",
       "8    9月  1550\n",
       "9   10月  1370\n",
       "10  11月  1410\n",
       "11  12月  1610"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('./exercise/chap07/exercise7_1.csv', encoding='gbk') # 读取数据\n",
    "df"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "提出假设:\n",
    "- $H_0:$ 观测频数与期望频数无显著差异(符合均匀分布)\n",
    "- $H_1:$ 观测频数与期望频数有显著差异(不符合均匀分布)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "卡方统计量 = 80.92, p值 = 9.778e-13\n"
     ]
    }
   ],
   "source": [
    "chi2, p_value = chisquare(f_obs=df['销售量'])\n",
    "print(f'卡方统计量 = {chi2:.2f}, p值 = {p_value:.4g}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于P<0.05, 所以拒绝原假设"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>男</th>\n",
       "      <th>女</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>是否逃课</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>逃过课</th>\n",
       "      <td>34</td>\n",
       "      <td>38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>未逃过课</th>\n",
       "      <td>28</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       男   女\n",
       "是否逃课        \n",
       "逃过课   34  38\n",
       "未逃过课  28  50"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('./exercise/chap07/exercise7_2.csv', encoding='gbk') # 读取数据\n",
    "df = df.set_index('是否逃课')\n",
    "df"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "提出假设:\n",
    "- $H_0:$ 是否逃课与性别独立\n",
    "- $H_1:$ 是否逃课与性别不独立"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "卡方统计量 = 1.541, 自由度 = 1, p值 = 0.2145\n",
      "期望频数为: \n",
      " [[29.76 42.24]\n",
      " [32.24 45.76]]\n"
     ]
    }
   ],
   "source": [
    "chi2, p_value, dof, exp = chi2_contingency(df)\n",
    "print(f'卡方统计量 = {chi2:.3f}, 自由度 = {dof}, p值 = {p_value:.4g}')\n",
    "print('期望频数为: \\n', exp)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于P>0.05, 所以不能拒绝原假设"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>关注</th>\n",
       "      <th>不关注</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>上市公司的类型</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>主板企业</th>\n",
       "      <td>50</td>\n",
       "      <td>70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>中小板企业</th>\n",
       "      <td>30</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>创业板企业</th>\n",
       "      <td>20</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         关注  不关注\n",
       "上市公司的类型         \n",
       "主板企业     50   70\n",
       "中小板企业    30   15\n",
       "创业板企业    20    5"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('./exercise/chap07/exercise7_3.csv', encoding='gbk') # 读取数据\n",
    "df = df.set_index('上市公司的类型')\n",
    "df"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "提出假设:\n",
    "- $H_0:$ 上市公司类型与对股价波动的关注程度独立\n",
    "- $H_1:$ 上市公司类型与对股价波动的关注程度不独立"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "卡方统计量 = 16.854, 自由度 = 2, p值 = 0.0002189\n",
      "期望频数为: \n",
      " [[63.15789474 56.84210526]\n",
      " [23.68421053 21.31578947]\n",
      " [13.15789474 11.84210526]]\n"
     ]
    }
   ],
   "source": [
    "# (1) 检验上市公司类型与对股价波动的关注程度是否独立(α=0.05)。\n",
    "chi2, p_value, dof, exp = chi2_contingency(df)\n",
    "print(f'卡方统计量 = {chi2:.3f}, 自由度 = {dof}, p值 = {p_value:.4g}')\n",
    "print('期望频数为: \\n', exp)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于P<0.05, 所以拒绝原假设"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "phi系数 = 0.29783, v系数 = 0.29783, c系数 = 0.28544\n"
     ]
    }
   ],
   "source": [
    "# (2) 计算上市公司类型与对股价波动的关注程度两个变量之间的phi系数、Cramer's V系数和列联系数，并分析其相关程度。\n",
    "n = 190 # 样本容量\n",
    "phi = np.sqrt(chi2 / n) # 计算phi系数\n",
    "v = np.sqrt(chi2 / (n * (2 - 1))) # 计算Cramer's V系数\n",
    "c = np.sqrt(chi2 / (chi2 + n)) # 计算列联系数\n",
    "print(f'phi系数 = {phi:.5f}, v系数 = {v:.5f}, c系数 = {c:.5f}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "三个系数的结果都表明上市公司类型与对股价波动的关注程度两个变量之间存在一定程度的相关性"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>东部地区</th>\n",
       "      <th>中部地区</th>\n",
       "      <th>西部地区</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>汽车价格</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10万元以下</th>\n",
       "      <td>20</td>\n",
       "      <td>40</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10—20万元</th>\n",
       "      <td>50</td>\n",
       "      <td>60</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20—30万元</th>\n",
       "      <td>30</td>\n",
       "      <td>20</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30万元以上</th>\n",
       "      <td>40</td>\n",
       "      <td>20</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         东部地区  中部地区  西部地区\n",
       "汽车价格                     \n",
       "10万元以下     20    40    40\n",
       "10—20万元    50    60    50\n",
       "20—30万元    30    20    20\n",
       "30万元以上     40    20    10"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('./exercise/chap07/exercise7_4.csv', encoding='gbk') # 读取数据\n",
    "df = df.set_index('汽车价格')\n",
    "df"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "提出假设:\n",
    "- $H_0:$ 地区与汽车价格独立\n",
    "- $H_1:$ 地区与汽车价格不独立"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "卡方统计量 = 29.991, 自由度 = 6, p值 = 3.946e-05\n",
      "期望频数为: \n",
      " [[35.  35.  30. ]\n",
      " [56.  56.  48. ]\n",
      " [24.5 24.5 21. ]\n",
      " [24.5 24.5 21. ]]\n"
     ]
    }
   ],
   "source": [
    "# (1) 检验地区与汽车价格是否独立(α=0.05)。\n",
    "chi2, p_value, dof, exp = chi2_contingency(df)\n",
    "print(f'卡方统计量 = {chi2:.3f}, 自由度 = {dof}, p值 = {p_value:.4g}')\n",
    "print('期望频数为: \\n', exp)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于P<0.05, 所以拒绝原假设"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "phi系数 = 0.27382, v系数 = 0.19362, c系数 = 0.26410\n"
     ]
    }
   ],
   "source": [
    "# (2) 计算地区与汽车价格两个变量之间的phi系数、Cramer's V系数和列联系数，并分析其相关程度。\n",
    "n = 400 # 样本容量\n",
    "phi = np.sqrt(chi2 / n) # 计算phi系数\n",
    "v = np.sqrt(chi2 / (n * (3 - 1))) # 计算Cramer's V系数\n",
    "c = np.sqrt(chi2 / (chi2 + n)) # 计算列联系数\n",
    "print(f'phi系数 = {phi:.5f}, v系数 = {v:.5f}, c系数 = {c:.5f}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "三个系数的结果都表明地区与汽车价格两个变量之间存在一定程度的相关性"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
